Performance Analysis of Compiler-Parallelized Programs on Shared-Memory Multiprocessors

نویسندگان

Seon Wook Kim

Michael Voss

Rudolf Eigenmann

چکیده

Shared-memory multiprocessor (SMP) machines have become widely available. As the user community grows, so does the importance of compilers that can translate standard, sequential programs onto this machine class. Substantial research has been done to develop sophisticated parallelization techniques, which can detect and exploit parallelism in many real applications. However, the performance of compiler-parallelized applications can be below expectations. The speedups of even fully-parallel codes on today's shared-memory multiprocessors can be signi cantly less than the number of processors. In this paper we will investigate reasons for such performance behavior. We will focus on three speci c issues: (1) We will determine whether it is appropriate for the preprocessor to express the detected parallelism in the common loop-oriented form, (2) we will determine sources of ine ciencies in fully parallel SMP programs that exhibit good cache locality, and (3) we will discuss the portability of these programs across SMP machines. In our experiments we have extended the Polaris compiler, so that it can generate threadbased code directly. We compare the performance of this code with Polaris' loop-parallel OpenMP output form and with architecture-speci c directive languages available on the Sun Enterprise and the SGI Origin systems. We have analyzed in detail the performance of several parallel Perfect Benchmarks. Our main ndings are as follows. (1) Overall, there is no signi cant performance disadvantage of the loop-parallel representation. (2) However, substantial performance di erences are attributable to the instruction e ciency, which is in uenced by the data sharing semantics of parallel constructs. (3) Both the OpenMP and the thread-based program forms are functionally portable, but can result in substantially di erent performance on the two machines. y This work was supported in part by DARPA contract #DABT63-95-C-0097 and NSF grants #9703180-CCR and #9872516-EIA. This work is not necessarily representative of the positions or policies of the U. S. Government.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple mod...

متن کامل

Architectural and Software Support for Executing Numerical Applications on High Performance Computers By

Numerical applications require large amounts of computing power. Although shared memory multiprocessors provide a cost-e ective platform for parallel execution of numerical programs, parallel processing has not delivered the expected performance on these machines. There are two crucial steps in parallel execution of numerical applications: (1) e ective parallelization of an application and (2) ...

متن کامل

Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of port...

متن کامل